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Abstract —We address the problem of tracking the 6-DoF 
pose of an object while it is being manipulated by a human 
or a robot. We use a dynamic Bayesian network to perform 
inference and compute a posterior distribution over the current 
object pose. Depending on whether a robot or a human 
manipulates the object, we employ a process model with or 
without knowledge of control inputs. Observations are obtained 
from a range camera. As opposed to previous object tracking 
methods, we explicitly model self-occlusions and occlusions 
from the environment, e.g, the human or robotic hand. This 
leads to a strongly non-linear observation model and additional 
dependencies in the Bayesian network. We employ a Rao- 
Blackwellised particle filter to compute an estimate of the object 
pose at every time step. In a set of experiments, we demonstrate 
the ability of our method to accurately and robustly track the 
object pose in real-time while it is being manipulated by a 
human or a robot. 

I. INTRODUCTION 

Manipulation of objects is one of the remaining key 
challenges of robotics. In recent years, tremendous progress 
has been made in the area of data-driven grasp synthesis [2], 
Given an object, the goal is to infer a suitable grasp that 
adheres certain properties, e.g. stability or functionality. 
In many cases, this grasp is then performed in an open- 
loop manner without taking any feedback into account, e.g. 
in [18, 1, 4], This approach can lead to a very poor success 
rate especially in the presence of noisy and incomplete sensor 
data, inaccurate models, or in a dynamic environment. We 
have recently shown that the robustness of grasp execution 
can be greatly increased by continuously taking tactile sensor 
feedback into account [14]. This enables the robot to adapt 
to unforeseen situations. 

In this paper, we use visual sensing to continuously track 
the 6-DoF pose of an object during manipulation, which 
could enable the robot to adapt its actions according to 
the perceived state of the environment. In contrast to our 
previous work [14], such adaptation would not rely on being 
in contact with the object. Visual tracking is also crucial 
for precision manipulation tasks such as drilling a hole or 
inserting a key into a lock [17]. These tasks require precise 
alignment of a tool in the robot hand with objects in the 
environment. 

We present a real-time marker-less object tracking algo¬ 
rithm as a basis for these kinds of systems. We consider the 
movement of an object as a stochastic process and model it 
in a dynamic Bayesian network. We perform inference in this 
network to compute a posterior distribution over the current 
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Fig. 1. The robot is tracking an impact wrench (left image) in real-time 
using a depth sensor (red object model in the right image). Fiducial markers 
are used for baseline comparison only. 


object pose. We follow the general paradigm of a Bayes 
filter [19] in which we (i) predict the current object pose 
given the previous pose and then (ii) update the prediction 
given an observation. For the first step, we use a process 
model that can either be dependent on control inputs (in case 
the robot is moving the object) or be based on the simple 
assumption that the object pose will not change significantly 
in a short time (in case the object is not being held by the 
robot). We use a range camera to obtain dense depth images 
of the scene. For the update step, we develop an observation 
model that, given the current estimate of the object pose, 
determines the likelihood of the observed depth data. We 
explicitly model self-occlusions and occlusions from the 
environment. Therefore, the algorithm becomes more robust 
to these effects. However, the observation model is strongly 
non-linear and new dependencies are introduced in the Bayes 
network. We therefore employ a Rao-Blackwellised particle 
filter [5] to compute a posterior distribution over the current 
object pose. 

The paper is structured as follows. In the next section, we 
review existing approaches to object tracking, especially in 
the context of robotic manipulation. We then briefly review 
the Bayes filter and associated inference methods in Sect. III. 
In Sect. IV and V the observation and process models are 
derived. In Sect. VI we formulate the proposed algorithm. 
Experimental results are presented in Sect. VII. Finally, we 
conclude and present ideas for future work in Sect. VIII. 

II. RELATED WORK 

Existing approaches to real-time 6-DoF pose tracking 
can be divided into two groups according to the type of 
data used. The first group consists of algorithms which 
mainly rely on 2D images, e.g. [7, 11, 3]. This kind of 
approaches may be sensitive to the amount of texture on 
the tracked object and in the scene as well as to lightning 




conditions. It has been shown that using depth data can add 
robustness to tasks such as SLAM [13], object detection 
and recognition [12, 9]. Therefore it is worth exploring the 
second group of methods for 6 DoF pose tracking that rely 
on range data. In this section, we review those which aim for 
real-time performance and use dense depth data as sensory 
input. 

Ren and Reid [16] represent the object surface as the zero 
level in a level-set embedding function. Specifically, it is 
based on a truncated 3D distance map. The pose of the 
object is found by minimizing over the sum of all points 
in the 3D point cloud back-projected into the object frame 
and evaluated in the embedding function. Optimization is 
performed through the Levenberg-Marquardt algorithm. The 
GPU implementation of this approach performs in real-time. 
Through the choice of a robust estimator and a robust variant 
of the 3D distance map it seems to work well even in 
the presence of heavy occlusions. There are however no 
experiments provided where an occluding object is in contact 
with the tracked object. It is not clear how the algorithm 
would perform in that situation since points from the oc¬ 
cluding object might be mistaken for the tracked object. Our 
method, in contrast, is formulated in a Bayesian framework 
that allows us to fuse different sources of information in a 
probabilistic manner, we fuse for example the knowledge of 
the control inputs with the depth measurements in order to 
estimate the pose of the object. Furthermore, we explicitly 
model occlusions in this framework and finally, we obtain 
a posterior distribution over the pose instead of just a point 
estimate. 

Hebert et al. [8] fuse sensory data from stereo cameras, 
monocular images and tactile sensors for simultaneously 
estimating the pose of the object and the robot arm with 
an Unscented Kalman Filter (UKF). Using all these sensor 
modalities, the approach allows to track the object well while 
it is being held by the robot. It is not clear how well it 
works when the object is not being held by the robot, since 
in that case less information about the motion is available. 
Our method uses the knowledge of the control inputs as well, 
when available, but it is able to track an object even when 
it is not being held by the robot. 

Ueda [20] uses a sampling-based approach. The object is 
represented as a partial point cloud and tracking is performed 
using a particle filter. The likelihood of an observation is 
computed using a function of the squared distance between 
each point in the object model and its closest point in the 
observed point cloud, as well as the distance between their 
respective colors. This approach neither models noise in the 
sensor nor occlusion. Since in terms of experimental results 
there is only a video available, it is not clear how well this 
method performs in general settings. 

In contrast to the mentioned algorithms, occlusion is 
modeled explicitly in our observation model. As we will 
show in Sect. IV in more detail, this leads to strong non- 
linearities. We therefore use a sampling-based approach to 
filter the pose of the object. This has the advantage that even 
non-Gaussian multi-modal distributions can be represented. 


The main drawback of sampling-based approaches is that 
they are computationally expensive. Nevertheless we show in 
the experimental section that we achieve real time tracking 
using only one CPU core. 

III. Bayes Filter 

In this section, we briefly discuss the Bayes filter and a 
number of techniques to perform inference in cases where 
the problem cannot be solved analytically. 

The assumptions made in a Bayes filter are twofold. 
Firstly, the Markov assumption asserts that each state only 
depends on the previous state. Secondly, it is assumed that 
the state is sufficient to predict the (potentially noisy) obser¬ 
vation [19]. These independence assumptions are represented 




Fig. 2. This graph represents the independence assumptions made by a 
Bayes filter. 

in a probabilistic graphical model (PGM) in Fig. 2. The goal 
in state estimation is to find the posterior p(xt \z\-t, ui-.t) over 
the state Xt given all the observations Z\-t and the control 
inputs u\-t- There are different approaches for performing 
inference in this model that depend on the form of the 
process model p(xt\xt-i, ut) and the observation model 
p(zt\xt). For a linear process and observation model and 
each being subject to Gaussian noise, the inference problem 
can be optimally solved with a Kalman Filter. 

If the process and observation model can be approximated 
as being locally linear and the noise can be assumed to 
be Gaussian, then an Extended Kalman Filter (EKF) or an 
Unscented Kalman Filter (UKF) can be used. In the EKF, 
the process and observation model are linearized to compute 
the covariance matrix of the current state estimate. In the 
UKF, samples around the mean of the current state estimate 
or prediction are drawn and pushed through the non-linear 
models. The projected samples serve as the base to estimate 
the Gaussian posterior distribution. 

Finally, there are nonparametric methods to solve the 
inference problem. The most well known is the particle 
filter which represents the posterior over the state at time 
t, p(x t \zi :t ,ui. t ), by a set of samples {x[ 1 ^} which are dis¬ 
tributed accordingly. Such a set of samples can be obtained 
by sampling from the distribution over the entire sequence 
of states p(xi : t\zi:t,ui-.t) and then dropping all the previous 
states Xi-.t-i- Consequently, the previous states do not have 
to be marginalized out in a particle filter, we can thus express 
p{xi:t\zi:t,u 1:t ) instead of p(x t \z 1:t , u 1:t ) [19], 

As will be discussed in more detail in Sect. IV, we 
introduce a variable o t describing which parts of the image 




are occluded, therefore our state x t consists of the 6-DoF 
pose and the occlusion ( rt,Ot ). The posterior distribution 
can then be written as 

= p(o 1:t \r 1:t ,z 1:t ,u 1:t )p(r 1:t \z 1:t ,u 1:t ). 

As a result of the functional form of our observation model, 
derived in Sect. IV, and our process model, discussed in 
Sect. V, the variables 0\ :t can be marginalized out analyti¬ 
cally while the variables ri :t cannot. In a dynamic Bayesian 
network with these properties, inference can be performed 
using a Rao-Blackwellised particle filter [5]. Integrating out 
the previous occlusions oi : t-i we obtain 

p(ri:t, 0 t \zi:t, Ui :t) = p{o t \r 1:t , Z 1:t , U 1:t ) ( 1 ) 

Since the variables r 1:t _x cannot be integrated out analyti¬ 
cally, the second term is represented, as in a common particle 
filter, with a set of samples. In a Rao-Blackwellised particle 
filter the posterior over the full state p(rt, Ot\zi-t,ui-.t) is 
thus represented by a set of particles r\ l \ distributed accord¬ 
ing to p(ri : t\zi : t,Ui;t), each associated with a probability 
P(o t \r[ l ) t , z 1:t ,u 1:t ). 

Before we apply this method to object tracking we will 
derive the observation and the process model. 

IV. OBSERVATION MODEL 

The objective is to infer the 6-DoF pose r t of an object 
assuming that we have a 3D model. The observation model 
p(zt\r t ) expresses what depth image we expect to observe 
given the pose of the object. However, if it is not known 
whether the object is occluded or not, we cannot predict the 
depth which should be measured. Therefore, we introduce 
a set of binary variables o t = {o\ } modelling occlusion. 
o\ = 0 means that the object is visible at time t in pixel 
i, whereas o\ = 1 indicates that it is occluded. Thus, the 
full state Xt = ( rt,Ot ) consists of the 6-DoF pose and the 
occlusion for each pixel i. The graphical model in Fig. 2 can 
be expanded as shown in Fig. 3. The depth measurements 



Fig. 3. This graph represents the independence assumptions made by a 
our filter. The box is a plate which represents the I pixels. 

{ z\ } at each pixel (including the noise) are assumed to be 
independent given the state. This assumption is reasonable 
since the pose and the occlusion variables allow us to predict 
the observation. Furthermore, we assume the occlusions o\ 


of different pixels to be independent. Although we thereby 
ignore relations between neighboring pixels, inference be¬ 
comes more efficient as will be described in Sect. VI. 

We can now write p(z t \r t ,o t ) = Y\iP{ z t\ r ti o l t ) as the 
product of measurement likelihoods at each pixel i given 
the pose and whether the object is occluded in pixel i. 
By marginalizing out the occlusion variables p{z\\r t ) = 
Yl 0 i P( z t\ r ti °t)p(°t)’ we obtain a model which is closely 
related to beam-based models discussed in robotics literature 
[19]. These consist of a weighted sum of an observation 
model assuming that the object is occluded and an ob¬ 
servation model assuming that it is visible. The important 
difference to our approach is that we continuously estimate 
the probability p{o\) of the object being occluded whereas 
in [19] this is a parameter which is set off-line and kept 
constant during execution. 

To express the observation model p(z\\rt, o\ ), we represent 
the measurement process in Fig. 4, which is a subpart of the 
graphical model in Fig. 3. Two auxiliary variables a 1 and b l 



Fig. 4. This graph shows the measurement process of the camera. It is a 
subpart of the graph in Fig. 3, zoomed in on one observation. 


have been introduced. These variables will be integrated out, 
but they are convenient for the formulation of the observation 
model. Since we are looking at only one time step, the time 
index is omitted here, a 1 is the distance to the tracked object 
along the beam defined by pixel i and b, is the distance to 
the object which is seen in pixel i\ this might not be the 
tracked object, depending on whether it is occluded. 

We can now describe the process which leads to the 
observation. Given the pose r, we can compute the distance 
to the tracked object a 1 for a given pixel i. Then given the 
distance to the tracked object and the information whether it 
is occluded, we can predict the distance to the object b l seen 
in pixel i, which might or might not be the tracked object. 
Knowing this distance, it is easy to predict a measurement 
z l . Using the independence assumptions from Fig. 4, this 
process can be formally written as 

p(z l \r,o l )= f p{z l \b l )p(b l \a\o l )p{a l \r). (2) 

J a,b 

In the following, we express each of these terms going from 
right to left. p(a l \r) is the probability distribution over the 
distance to the tracked object given the pose. This is simply 
the distance d l (r) to the intersection of the beam defined by 
pixel i with the object model in pose r. Additionally, there 
is some noise due to errors in the object model: 

p(a l \r) = N(a l \d\r),a m ). (3) 















p(b l \a l ,o l ) expresses the distance to the object measured 
in pixel i given the distance to the tracked object and the 
occlusion. If the o l = 0 the object is visible, therefore a 1 
and b l have to be identical. Otherwise h, has to be smaller 
than a,i, since the occluding object is necessarily in front of 
the tracked object. Formally we can write this as 


p(b i \a i ,o i ) = 


(4) 


bib 1 -a 1 ) if o® = 0 

I{b l > 0 A b l < d 1 ) if o i = 1 

where I is the indicator function which is equal to one if 
the condition is true and zero otherwise. 6 denotes the Dirac 
delta function. We assume that the probability of a beam 
intersecting with an object other than the tracked object is 
equal for any interval of equal length. This implies that the 
probability of the beam hitting the first object at a distance b 1 
decays exponentially if the observed object is not the tracked 
object, see Eq. 4. The parameter A is a function of the half- 
life, i.e. the distance at which we expect half of the beams 
to not have intersected with an object yet. The algorithm 
is not sensitive to this parameter, therefore we simply set 
the half-life to a number which seems reasonable. In all our 
experiments we used 1 m. 


Finally, we express which is the distribution over 

the measurementgiven the distance of the observed object 
to the camera. The difference in these two quantities is due 
to noise in the measurement of the range camera. We used an 
Asus XTION Pro depth sensor in our experiments and thus 
attempted to model its noise. Khoshelham and Elberink [10] 
estimated the noise in the depth measurements of a Microsoft 
Kinect which is based on the same design as the sensor we 
used. The authors showed that the noise increases with the 
depth squared. We define the camera standard deviation <j c 
accordingly. Fallon et al. [6] model the noise in the Kinect 
camera by the sum of a Gaussian distribution and a constant 
term to shift more weight to the tails than in a purely normal 
distribution. We also found it to be advantageous to have a 
heavy tailed distribution and thus define the distribution in a 
similar fashion as 


K^Hl- /3)A^ i |6> c ) + /3 


I{z l > 0 A z i < m) 


(5) 


where I is the indicator function and m is the maximum 
depth which can be measured by the range camera, 6 m in 
our case. In all our experiments the weight of the tails /3 was 
set to 0.01, as proposed in [6] and a c was defined according 
to [10], 



Fig. 5. This graph shows the probability of an observation given a pose 
p(z z \r) = 0 , p{z' \r. o l )p{o r ). The pose r is such that the distance to 
the object model d 1 (r) is equal to 1 m. There are five curves for different 
probabilities of the object being visible p(o® = 0). 

probability of measuring a depth which is smaller than the 
predicted depth diminishes as we become more certain that 
the object is not occluded. 

V. PROCESS MODELS 

The pose process model p(rt\rt-i, Ut) describes how 
we expect the object pose to change over time given the 
last pose and the control inputs. The occlusion process 
model piollol-i) describes how the occlusions propagate. In 
this section we will model these two probability distributions. 


A. Pose Process Model 

The process model for the pose p(r t |r t _i, u t ) depends of 
course on the underlying process. In our experiments we test 
two different scenarios. In the first case where the object is 
moved by a human arm, we do thus not have access to the 
control inputs and therefore simply propagate the pose by a 
small random rotation and translation drawn from a Gaussian 
distribution. 

In the second case, the object is moved by a robot and 
we thus make use of the control inputs. From the control 
inputs, we can compute the expected velocity of the object 
being held for every time step. Since this velocity is not 
entirely accurate, we model the noise as being Gaussian in 
the translational as well as the rotational velocity. When the 
object is being moved by a robot, the uncertainty is of course 
much smaller than when it is being moved by a human arm. 

B. Occlusion Process Model 


Now the observation model is fully specified. We plot 
the likelihood of the pose p(z l \r) = p(~*l r > o1 )p{° 1 ) i n 

Fig. 5 as a function of the depth measured in pixel i , given 
that the predicted distance to the object d l {r ) is equal to 
1 m. For a small probability of the object being visible the 
function is almost constant in observed depth z l with a steep 
decrease at the predicted depth d l (r) = 1. This intuitively 
makes sense since the occluding object has to be closer to 
the camera than the tracked object. As the probability of the 
object being visible increases, a more and more pronounced 
spike appears at the predicted depth. At the same time the 


The distribution ^(ojlo^) has only two parameters since 
the variables are binary. We have to determine the probability 
of the object being visible in pixel i given that it was visible 
in the last time step, p{o\ = Oloj^ = 0) and the probability 
that the object is visible given that is was occluded in the 
last time step p{o\ = 0| o\_ 1 = 1). In all our experiments we 
set p{o\ = 0| o\_ x = 0) = 0.9 and p{o\ = 0| o l t _ 1 = 1) = 0.3 
for a time delta of one second. This represents the fact that 
if the object was observed in the last time step we expect 
to observe it again in the current time step. The algorithm 
proved to not be very sensitive to these parameters, so their 


















values were simply set by hand without any tuning. A more 
systematic estimation is the subject of future work. 

VI. FILTERING ALGORITHM 

Since we have introduced a new set of variables, we do 
not only infer the pose of the object r t but also the occlusion 
o t . As can be seen from Fig. 5, our observation model 
is highly nonlinear and results in a likelihood of the pose 
p(zt\rt) which does not fit well to a normal distribution. 
With a Gaussian distribution, we can essentially express that 
a variable should be somewhere close to a certain value. 
However as discussed in Sect. IV, for the proposed algorithm 
it is central to also be able to express that a variable should 
be larger than a certain value. This requirement discards 
algorithms which approximate the p{zt\rt) with a normal 
distribution such as KF, EKF and UKF. Therefore, we use 
sampling to solve the inference problem for the pose r t . 
As shown below, marginalizing out the occlusion variables 
Ot. is tractable. We can thus perform inference using a 
Rao-Blackwellised particle filter as described in Sect. III. 
Rewriting Eq. 1 for convenience, we have 

p{ri:t, o t \zi:t, ui:t)=p(o t \r 1:t , Z 1:t , Ui, t )p{r 1 , t \zi :t , Ui:t) (6) 

We are mainly interested in the second term which expresses 
the distribution over the poses given all the measurements 
and control inputs. As we will see shortly, the estimate of the 
pose depends on the estimate of the occlusion i.e. the first 
term in Eq. 6. In the following, we will therefore express 
these two terms recursively. 

From Fig. 3 we can see that the first term in Eq. 6 factor¬ 
izes as p{o t \r t ,z 1:t ,u 1:t ) = YliP(°i\ r i-U Zi-.t, u 1:t ). Taking 
the independence assumptions from Fig. 3 into account, we 
can express each factor in a recursive form. 

p(o l t \r 1:t ,z 1:t ,u 1:t ) = (7) 

T,oi_ 1 [p(4\n,oi)p(oi\oi- 1 )p{oi- 1 \r 1: t-i,z 1:t -i,u 1 ..t-i)] 

Y^oi,oi_lp(4\ r t,o i t )p(o i t \oi_ 1 )p{ol_ 1 \r 1:t _ 1 ,z 1:t _ 1 ,u 1:t _ 1 )] 

where p(z\\r t ,o\) represents the observation model (see 
Sect. IV), and p{o\\o\_ 1 ) is the occlusion process model (see 
Sect. V). 

Taking the independence assumptions from Fig. 3 into 
account, we can write the second term from Eq. 6 as 

p(ri:t\Zl:t,U 1: t) OC (8) 

p(zt\ri : t,zi:t-i)p(rt\rt-i,ut)p(ri : t-i\zi:t-i,u 1 :t -i) 
where p(rt\r t -i,ut) is the pose process model as discussed 
in Sect. V. We can obtain a set of samples distributed 
according to p(ri-.t\zi:t,ui : t) by taking the samples from 
the previous time step, propagating them with the process 
model p(rt\rt-i,ut) and resampling using the likelihoods 
p{ z t\ r i:t, Zi-.t-i)- In a common particle filter the likelihood 
p{ z t\r\-.t, Zi-.t-i) reduces to p{z t \rt), but here it depends on 
the estimate of the occlusion: 

p{zt\r 1:t ,z 1: t-i) = (9) 

T. p{z t \r t ,o t )p(o t \o t - 1 )p(ot-i\r 1:t -i,z 1:t -i,u 1:t -i). 


The sums which marginalize out the occlusion variables 
contain 2 1 terms, with / being the number of pixels. This 
is of course intractable, but given our assumptions, all the 
terms inside of the sum factorize over the pixels i, and we 
can move the sum inside of the product. 

p(zt\ri:t, Zl:t-l) = (10) 

n z>( 

* °h°l-i 

Now we only have to sum I times over 4 terms. The last 
term is identical to Eq. 7, just shifted by one time step. The 
estimate of the pose thus depends on the estimate of the 
occlusion through the likelihood. 

The initialization of the tracking is not a focus of this 
paper. In practice we tackle this problem by putting the 
object in a configuration where we have a strong prior over 
its pose, for example when it is standing on a table or 
being held by the robot. We then initialize the filter with 
a very large number of particles sampled from this prior. 
The initialization could also come from a different algorithm 
which is better suited for a global search. 

Now the algorithm is fully defined: 

• From the previous time step we have a set 

of particles distributed according to 

p(ri:t-i\zi:t-i,ui : t-i) and for each of these particles 
we know p(c>t_ 1 |r^_ 1 , Zi-.t-i, Furthermore 

we know the control ut which is applied during the 
current time step and we observe a depth image z t . 

• For each particle in {r’i^ t _ 1 } 

- We draw a sample from the pose process model 

— We compute the likelihood p(z t \r[ l ) t , Zi-.t-i) ac¬ 
cording to Eq. 10. 

- We update the occlusion probabilities 

p( 0 l\ r i) t , z ut, u v.t) for each pixel according 
to Eq. 7. 

• We resample the particles according to the likelihoods. 
We thus now have a set of particles {r®} distributed 
according to p{ri-.t\zi : t,ui-.t), and the corresponding 
occlusion probabilities p(o\ |r^.|, zi-.t, We can now 
go to the next time-step and repeat the procedure above. 

VII. EXPERIMENTAL RESULTS 

In the following, we describe our experimental setup and 
present results. Due to the difficulty in obtaining ground truth 
information, we use a previously implemented fiducial-based 
object tracker as a baseline method. We want to emphasize 
that our approach does not rely on the presence of these 
fiducials. 

A. Experimental Setup 

Our experiments are based on the dual arm manipulation 
platform shown in Fig. 1. The head is actuated by two 
stacked pan-tilt units, and consists of the following sensors: 
(a) an Asus Xtion PRO range camera, (b) a Point Grey 



Bumblebee2 stereo camera, and (c) a Prosilica high reso¬ 
lution color camera. These cameras are calibrated w.r.t each 
other using an offline calibration procedure. The robot has 
two 7-DoF Barrett WAM arms, each equipped with a 3- 
fingered Barrett Hand as the end-effector. The WAM arms are 
cable driven, and joint positions are measured using absolute 
encoders on the motors and not the joints themselves. The 
variable stretch of the cables depending on the robot pose 
and payload imply that the kinematic model of the arm 
is not accurate enough to perform manipulation tasks with 
sufficient precision [15]. 

We use the Asus Xtion as the sensor for our algorithm, 
which provides depth images at a rate of 30 Hz. The 
depth images produced by this camera have a resolution 
of 640x480, for our purposes however an image which has 
been downsampled by a factor of 5 proved to be detailed 
enough. Our observation is thus an array of size 128x96 
containing the depth measurements from the camera. We 
would like to stress that apart from the downsampling 
no preprocessing whatsoever is required by the proposed 
method. Our implementation currently uses a single core of 
a 3.3GHz Intel Xeon W5590 CPU, and is able to sample 
and evaluate 200 poses in real-time per camera frame. We 



Fig. 6. The fiducial tracker uses camera images to track the round fiducial 
markers (left). The obtained pose estimate (green) is used as a baseline 
comparison. The forward kinematics model (blue) is misplaced because 
of deliberate disturbances (small image). The proposed approach achieves 
accurate tracking (red) even in the presence of occlusions. 


compare the results from our method with a previously 
implemented baseline method which tracks objects based 
on a known pattern of fiducials on the object, using the 
Bumblebee2 stereo camera. This method is based on an EKF 
which maintains a distribution over the pose of the object. 
Fiducials are detected in each camera image using a local 
template-matching approach 1 (see Fig. 6 left). The template 
for each fiducial depends on the projection of the expected 
fiducial pose in the camera. The 3-D position of each fiducial 
is then reconstructed from their 2-D positions in each camera. 
The full 6-DoF object pose is solved for by minimizing the 
squared error between the detected fiducial positions and 
their corresponding positions in the object model. The object 
pose thus obtained is treated as an observation which updates 
the pose distribution in the EKF. This EKF uses the same 

'The fiducial detector was kindly provided by Paul Hebert. Jeremy Ma, 
and Nicolas Hudson from the Jet Propulsion Laboratory. 


process models as the ones described in Sect. V, depending 
on whether the object is in the robot hand or not. Finally, we 
also contrast these visual tracking methods with estimates 
of the object pose based on the kinematic model of the 
robot. The position of the hand is computed using the motor 
encoders. We compute a fixed offset between the hand and 
the object at the time of grasping, and assume that this offset 
remains constant for the entire motion. 

B. Results 


Our algorithm was evaluated in three scenarios: (a) object 
manipulated by a human (Fig. 6 left), (b) object manipulated 
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Fig. 7. Scenario (a): 6-DoF trajectories of the object being manipulated 
by a human hand, as tracked by the proposed method (red) and the baseline 
fiducial tracking method (green). The fiducial tracking method is susceptible 
to errors when fiducials are occluded (e.g., from £=33 to £=37), while our 
method is more robust to occlusions. 
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Fig. 8. Scenario (b): 6-DoF trajectories of the object being manipulated by 
the robot with a secure grasp. The errors in the forward kinematics method 
(blue) are due to imprecise joint position sensing under load. The proposed 
approach (red) followed the baseline (green) closely. 

by the robot with a secure grasp, and (c) object manipulated 
by the robot (Fig. 6 right), with a human disturbing the 
pose of the object in the hand. Videos of each of these 
scenarios may be found in the multimedia attachment or 
online at http://youtu.be/MBgggaJqlsY. 

Figure 7 shows the position (x, y, z) and orientation (roll, 
pitch, yaw) trajectories of the object being moved by a 
human, as tracked by the baseline fiducial tracking method 
(shown in green) and our proposed method (shown in red). 
A qualitative assessment reveals that the fiducial tracking 
method is susceptible to errors when fiducials are occluded 
(e.g., from /=33 to f=37), while the proposed method tracks 


Fig. 9. Scenario (c): 6-DoF trajectories of the object being manipulated 
by the robot, while being disturbed by a human. Changes in the pose of 
the object in the hand cannot be tracked by the forward kinematics method 
(blue). Our marker-less approach (red) successfully tracked the object and 
followed the baseline fiducial-based approach (green) closely. 

the object pose more robustly in spite of occlusions. 

Results from scenario (b) where the object is manipulated 
by the robot with a secure grasp are shown in Figure 8. In 
addition to the baseline method and the proposed method, we 
also show pose estimates of the object based on the robot 
kinematic model (in blue). This estimate significantly devi¬ 
ates from the two visual tracking methods due to undetected 
stretching of the Barrett WAM cables, which is exacerbated 
by the weight of the object in the hand. Additionally, we 
introduced external disturbances that did not affect the pose 
of the object with respect to the hand. Such errors in the 



























































forward kinematics model render precise manipulation and 
tool use impossible without visual tracking and control. 

Figure 9 shows results from the final scenario, in which 
the object is manipulated by the robot, while the pose of 
the object in the hand is being disturbed by a human. By 
definition, manipulation tasks involve contact of the object 
in the hand with the external world. These contacts often 
result in slippage of the object in the hand, and in such cases 
visual tracking can be tremendously beneficial for successful 
completion of the task. Similar to the previous scenario, we 
notice that the two visual tracking methods perform equally 
well. In contrast, the forward kinematics method shows large 
errors as it cannot account for movement of the object in the 
hand (see Fig. 6 right). 

VIII. CONCLUSION AND FUTURE WORK 

We have presented a probabilistic approach to object track¬ 
ing using a range camera. Occlusions are explicitly modeled 
in our approach, which adds robustness and removes the need 
to filter out points belonging to the robot. Our method is fast 
enough for real-time tracking performance on a single core 
of a modern computer, and the sampling process trivially 
lends itself to parallelization. 

In future work, we plan to apply this method to tracking 
the posture of articulated objects such as the robot arm itself 
to accurately estimate the arm kinematic configuration. We 
also plan to apply the probabilistic models developed in this 
work to alternate sensor modalities such as tactile sensors. 
Finally, we will use the presented object tracking approach 
to further close perception-action loops as in [14]. 
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